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Abstract 

We  show  how  to  turn  a  regular  expression  into  an  0{s)  space  representation  of  McNaughton  and 
Yamada’s  NFA,  where  s  is  the  number  of  NFA  states.  The  standard  adjacency  list  representation  of 
McNaughton  and  Yamada’s  NFA  takes  up  s  +  s2  space  in  the  worst  case.  The  adjacency  list  representation 
of  the  NFA  produced  by  Thompson  takes  up  between  2 r  and  5 r  space,  where  r  >  s  in  general,  and  can 
be  arbitrarily  larger  than  s.  Given  any  set  T  of  NFA  states,  our  representation  can  be  used  to  compute 
the  set  N  of  states  one  transition  away  from  the  states  in  T  in  optimal  time  0(|T|  +  |1V|).  McNaughton 
and  Yamada’s  NFA  requires  0(|T|  x  |1V|)  in  the  worst  case.  Using  Thompson’s  NFA,  the  equivalent 
calculation  requires  0(r)  time  in  the  worst  case. 

An  implementation  of  our  NFA  representation  confirms  that  it  takes  up  an  order  of  magnitude  less 
space  than  McNaughton  and  Yamada’s  machine.  An  implementation  to  produce  a  DFA  from  our  NFA 
representation  by  subset  construction  shows  linear  and  quadratic  speedups  over  subset  construction 
starting  from  both  Thompson’s  and  McNaughton  and  Yamada’s  NFA’s.  It  also  shows  that  the  DFA 
produced  from  our  NFA  is  as  much  as  one  order  of  magnitude  smaller  than  DFA’s  constructed  from  the 
two  other  NFA’s. 


1  Introduction 

The  growing  importance  of  regular  languages  and  their  associated  computational  problems  in  languages 
and  compilers  is  underscored  by  the  granting  of  the  Turing  Award  to  Rabin  and  Scott  in  1976,  in  part,  for 
their  ground  breaking  logical  and  algorithmic  work  in  regular  languages  [16].  Of  special  significance  was 
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1 


their  construction  of  the  canonical  minimum  state  DFA  that  had  been  described  nonconstructively  in  the 
proof  of  the  Myhill-Nerode  Theorem[14,15].  Rabin  and  Scott’s  work,  which  was  motivated  by  theoretical 
considerations,  has  gained  in  importance  as  the  number  of  practical  applications  has  grown.  In  particular, 
the  construction  of  finite  automata  from  regular  expressions  is  of  central  importance  to  the  compilation 
of  communicating  processes[4],  string  pattern  matching[3],  model  checking[8],  lexical  scanning[2],  and  VLSI 
layout  design[20];  unit  time  incremental  acceptance  testing  in  a  DFA  is  also  a  crucial  step  in  LR /,  parsing[12]; 
algorithms  for  acceptance  testing  and  DFA  construction  from  regular  expressions  are  implemented  in  the 
UNIX  operating  system[17]. 

Throughout  this  paper  our  model  of  computation  is  a  uniform  cost  sequential  RAM  [1],  We  report  the 
following  four  results. 

1.  Recently  Berry  and  Sethi[5]  used  results  of  Brzozowski[6]  to  formally  derive  and  improve  McNaughton 
and  Yamada’s  algorithm[13]  for  turning  regular  expressions  into  NFA’s.  NFA’s  produced  by  this 
algorithm  have  fewer  states  than  NFA’s  produced  by  Thompson’s  algorithm [18],  and  in  practice  they 
are  known  to  outperform  Thompson’s  NFA’s  for  acceptance  testing.  Berry  and  Sethi’s  algorithm  has 
two  passes  and  can  easily  be  implemented  to  run  in  time  0(ra)  and  auxiliary  space  O(r),  where  r  is 
the  length  of  the  regular  expression,  and  m  is  the  number  of  edges  in  the  NFA  produced.  We  present 
an  algorithm  that  computes  the  same  NFA  in  a  single  left-to-right  scan  over  the  regular  expression. 
It  runs  in  the  same  asymptotic  time  0(m)  as  Berry  and  Sethi,  but  it  improves  the  auxiliary  space  to 
0(s),  where  s  is  the  number  of  occurrences  of  alphabet  symbols  appearing  in  the  regular  expression. 

2.  One  disadvantage  of  McNaughton  and  Yamada’s  NFA  is  that  its  worst  case  number  of  edges  is 
m  =  0(s2),  which  is  also  a  worst  case  space  bound  for  the  standard  adjacency  list  implementation. 
Thompson’s  NFA  only  has  between  r  and  2 r  states  and  between  r  and  3 r  edges.  We  introduce  a  new 
compressed  representation  for  McNaughton  and  Yamada’s  NFA  that  uses  only  0(s)  space.  Our  com¬ 
pressed  NFA  can  be  constructed  from  a  regular  expression  R  in  0(r)  time  and  0(s)  auxiliary  space. 
It  supports  acceptance  testing  in  worst-case  time  0(s|o,j)  for  arbitrary  string  x,  and  a  promising  new 
way  to  construct  DFA’s  faster  than  the  classical  subset  construction  of  Rabin  and  Scott. 

3.  Our  main  theoretical  result  is  a  proof  that  the  compressed  NFA  can  be  used  to  compute  the  set  of 
states  N  one  edge  away  from  an  arbitrary  set  of  states  T  in  McNaughton  and  Yamada’s  NFA  in  optimal 
time  0(\T\  +  |7Vj).  The  previous  best  worst-case  time  is  0(|T|  x  |7Vj). 

4.  We  give  empirical  evidence  that  our  algorithm  for  NFA  acceptance  testing  using  the  compressed  NFA 
yields  a  constant  factor  speedup  over  acceptance  testing  using  Thompson’s  NFA,  and  is  comparable  to 
McNaughton  and  Yamada’s  NFA.  We  give  more  dramatic  empirical  evidence  that  constructing  a  DFA 
from  our  compressed  NFA  can  be  achieved  in  time  one  order  of  magnitude  faster  than  the  classical 
Rabin  and  Scott  subset  construction  (cf.  Chapter  3  of  [2])  starting  from  either  Thompson’s  NFA  or 
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McNaughton  and  Yamada’s  NFA.  Our  benchmarks  also  show  subset  construction  being  faster  when  it 
starts  from  Thompson’s  machine  than  from  McNaughton  and  Yamada’s  NFA. 

The  next  two  sections  present  standard  terminology  and  background  material,  and  can  be  skipped  by 
anyone  who  knows  Chapter  3  of  [2].  Section  4  reformulates  McNaughton  and  Yamada’s  algorithm  from  an 
automata  theoretic  point  of  view.  Section  5  describes  our  new  algorithm  to  turn  a  regular  expression  into 
McNaughton  and  Yamada’s  NFA.  In  Section  6  we  show  how  to  construct  a  compressed  form  of  this  NFA. 
Analysis  of  the  compressed  NFA  is  presented  in  Theorem  6,  which  is  our  main  theoretical  result.  In  section 
7,  we  show  how  to  further  compressed  our  NFA.  Section  8  discusses  experimental  results  showing  how  our 
compressed  NFA  compares  with  other  NFA’s  in  solving  acceptance  testing  and  DFA  construction.  Section  9 
mentions  future  research. 


2  Terminology 

The  following  basic  definitions  and  terminology  can  be  found  in  [10].  By  an  alphabet  we  mean  a  finite 
nonempty  set  of  symbols.  If  £  is  an  alphabet,  then  E*  denotes  the  set  of  all  finite  strings  of  symbols  in  E.  If 
£  is  an  alphabet,  then  any  subset  of  S*  is  a  language  over  E.  If  L i  and  L2  are  two  languages,  then  the  cross 
product  L\  L2  =  {xy  :  x  £  Li,y  £  L 2}  represents  the  set  of  all  strings  x  y  that  result  from  concatenating 
each  x  £  L\  with  each  y  £  L 2-  If  A  stands  for  the  empty  string,  and  0  represents  the  empty  set,  then 
F  {A}  =  {A}  Z  =  L.  and  L  0  =  0  L  =  0  for  any  language  L. 

Definition  1  Let  Lr  be  the  language  denoted  by  regular  expression  R.  Let  E  be  a  finite  alphabet 
regular  expressions  are  the  smallest  set  of  terms  that  contains 

•  0  (which  represents  the  empty  set) 

•  A  (which  represents  the  set  {A}j,  where  A  is  the  empty  string 

•  a  (which  represents  the  set  {a} )  for  each  symbol  a  £  S 

•  TI51  (which  represents  the  set  Lr  U  Lr),  where  T  and  S  are  regular  expressions 

•  TS  (which  represents  the  cross  product  set  LrLr),  where  R  and  S  are  regular  expressions 

•  T*  (which  represents  Ifp  5,.{A}UFt5,j  where  Ifp  X.E(X)  is  the  minimum  value  X  such  that  X 
where  T  is  a  regular  expression. 

A  nondeterministic  finite  automata  (abbr.  NFA)  M  is  a  5-tuple  (E ,Q,  I,  F,6),  where  Q  is  a  set  of 
states,  I  C  Q  is  the  set  of  initial  states,  F  C  Q  is  the  set  of  final  states,  and  6  C  Q  x  (E  x  Q)  is  a  labeled 
directed  graph  with  vertices  Q  and  an  edge  labeled  a  connecting  state  q  to  state  p  for  every  [g,[a,p]]  belonging 
to  6.  For  all  q  £  Q  and  a  £  E  we  use  the  notation  6(q,a )  to  denote  the  set  {p  :  [ q ,  [a,p\]  £  <5}  of  all  states 


.  Then  the 


=  E(X)), 
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reachable  from  state  q  by  a  single  edge  labeled  ‘a’.  It  is  useful  to  generalize  this  notation  with  the  following 
rules,  where  TC<3,s£S*,  and  B  CS*: 


6(T,a) 

=  U  qeT6(q,a) 

6(q,as) 

=  6(6(q,a),s) 

6(T,s) 

=  U?eT^(^,s) 

6(T,B) 

=  UieB6(T,  b) 

The  language  L  accepted  by  M,  denoted  by  L(M ),  is  defined  by  the  rule 

set  ~  6(1,  s)  nF  ±  0  (l) 

In  other  words,  t  =  {se  £*|<5(  1,  s)  fl  F  f-  0}.  M  is  a  deterministic  finite  automata  (abbr.  DFA)  if  graph 
6  has  no  more  than  one  edge  with  the  same  label  leading  out  from  each  vertex,  and  if  I  contains  exactly  one 
state.  Regular  expressions  and  NFA’s  that  represent  the  same  regular  language  are  said  to  be  equivalent. 

3  Background 

Kleene  [11]  characterized  regular  languages  equivalently  in  terms  of  languages  denoted  by  regular  expressions 
and  languages  accepted  by  DFA’s.  Rabin  and  Scott  [16]  showed  that  NFA’s  also  characterize  the  regular 
languages,  and  their  work  led  to  algorithms  to  decide  whether  an  arbitrary  string  is  accepted  by  an  NFA. 

Let  n  be  the  number  of  NFA  states,  m  be  the  number  of  edges,  and  k  be  the  alphabet  size.  For  an  NFA 
represented  by  an  adjacency  matrix  of  size  n2  for  each  alphabet  symbol,  acceptance  testing  takes  0(n\x\) 
bit  vector  operations  and  O(n)  auxiliary  space.  Alternatively,  for  an  NFA  implemented  by  an  adjacency 
list  of  size  m  with  a  perfect  hash  table  [9]  storing  the  alphabet  symbols  at  each  state,  this  test  takes  time 
proportional  to  m\x\  in  the  worst  case.  For  DFA’s  the  same  data  structure  leads  to  a  better  time  bound  of 
6>(  |a? |)  -  However,  there  are  NFA’s  for  which  the  smallest  equivalent  DFA  (unique  up  to  isomorphism  of  state 
labels  as  shown  by  Myhill  [14]  and  Nerode  [15])  has  an  exponentially  greater  number  of  states.  Thus,  the 
choice  between  using  an  NFA  or  DFA  is  a  space/time  tradeoff. 

There  are  two  main  approaches  for  turning  regular  expressions  into  equivalent  NFA’s.  One,  due  to 
Thompson  [18],  constructs  an  NFA  (augmented  with  A  edges)  in  which  the  number  n  of  states  is  somewhere 
between  the  length  r  of  the  regular  expression  and  2 r,  and  the  outdegree  of  any  state  is  no  greater  than  2. 
Consequently  m  =  O(n),  and  the  adjacency  list  implementation  does  not  even  require  perfect  hashing  to 
preserve  the  0(«|a;|)  time  bound.  Thompson’s  construction  is  a  simple,  bottom-up,  method  that  processes 
the  regular  expression  as  it  is  parsed.  The  time  and  space  is  linear  in  r. 

Another  approach,  based  on  Berry  and  Sethi’s  [5]  improvement  to  McNaughton  and  Yamada  [13],  con¬ 
structs  an  NFA  in  which  the  number  n  of  states  is  precisely  one  plus  the  number  s  of  occurrences  of  alphabet 
symbols  appearing  in  the  regular  expression.  In  general,  s  can  be  arbitrarily  smaller  than  r. 
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For  the  bit  matrix  representation,  McNaughton  and  Yamada’s  NFA  can  be  used  to  solve  acceptance 
testing  using  0(s|a;|)  bit  vector  operations,  which  is  superior  to  the  time  bound  for  Thompson’s  machine. 
With  the  adjacency  list  representation  the  worst  case  number  of  edges  m  =  fi(s2)  leads  to  a  worst  case 
time  bound  0(m|a:|)  which  is  one  order  of  magnitude  worse  than  the  time  bound  for  Thompson’s  machine. 
However,  the  fact  that  McNaughton  and  Yamada’s  NFA  is  a  DFA  when  all  of  the  alphabet  symbols  are 
distinct  may  explain,  in  part,  why  it  is  observed  to  outperform  Thompson’s  NFA  for  a  large  subclass  of  the 
instances.  The  Berry/Sethi  construction  scans  the  regular  expression  twice,  and,  with  only  a  little  effort, 
both  passes  can  be  made  to  run  in  linear  time  and  auxiliary  space  with  respect  to  r  plus  the  size  of  the  NFA 
(for  either  adjacency  list  or  matrix  implementations). 

There  is  one  main  approach  for  turning  NFA’s  (constructed  by  either  of  the  two  methods  above)  into 
DFA’s.  This  is  by  the  Rabin  and  Scott  subset  construction  [16]. 

4  McNaughton  and  Yamada’s  NFA 

It  is  convenient  to  reformulate  McNaughton  and  Yamada’s  transformation  from  regular  expressions  to 
NFA’s [13]  in  the  following  way. 

Definition  2  A  normal  NFA  (abbr.  NNFA)  is  an  NFA  with  one  starting  state  qo  having  no  edges  leading 
into  it,  and  all  edges  leading  into  each  state  are  labeled  with  the  same  symbol.  For  an  NNFA  with  alphabet  E 
the  transition  map  is  represented  by  a  binary  edge  relation  6  C  Q  x  Q  and  assignment  A  :  (Q  —  {50})  — +  E, 
where  A{q)  is  the  label  assigned  to  every  edge  leading  into  state  q. 

Definition  3  If  M  —  (E,  Q,  qo,  F,5,A)  is  an  NNFA,  then  tail(M )  =  (E,  Q  —  {50},  <5{9o},  F  —  {50},  {[<?,  t]  G 
S\q  f  qo},A). 

It  is  a  desirable  and  obvious  fact  (which  follows  immediately  from  the  definition  of  an  NNFA)  that  when 
A  is  one-to-one,  then  no  state  can  have  more  than  one  transition  with  the  same  label.  Hence,  such  an  NNFA 
is  a  DFA. 

We  can  implement  McNaughton  and  Yamada’s  algorithm  to  turn  a  regular  expression  R  into  an  NNFA 
while  performing  a  single  left-to-right  shift/reduce  parse  of  R  (but  without  actually  producing  a  parse  tree). 
To  explain  how  this  is  done,  we  use  the  notational  convention  that  Mr  denotes  an  NFA  equivalent  to  regular 
expression  R.  Each  time  a  subexpression  S  of  R  is  reduced  during  parsing,  tail(Ms)  is  computed,  where  Mr 
is  an  NNFA  equivalent  to  S.  The  last  step  computes  an  NNFA  Mu  from  tail(Mji).  However,  Mr  cannot  be 
computed  from  tail(Mn)  unless  we  know  whether  Mr  accepts  A,  which  indicates  whether  or  not  the  start 
state  for  Mr  is  a  final  state. 

Regular  expressions  are  restricted  if  0  is  not  a  subexpression.  There  is  a  linear  time  algorithm  to 
convert  regular  expressions  into  their  equivalent  restricted  forms.  Without  loss  of  generality,  we  will  assume 
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throughout  this  paper  that  regular  expressions  are  restricted. 

Let  nullfi  =  {A}  if  A  £  Lr;  otherwise,  let  nullR  =  0.  If  tail(Mn)  =  ( Y,Q,I,F,6,A ),  and  qo  ^  Q,  then 
the  following  formula 

Mr  =  (S,QU  {goMgoL-FU  {{q0}nullR),6  U  {[go,  y]  ■  y  e  I),  A)  (2) 

indicates  how  to  compute  Mr  from  tail(MR )  and  uuIIr. 

Theorem  1  (McNaughton  and  Yamada)  Given  any  regular  expression  R  with  s  occurrences  of  alphabet 
symbols  from  £,  we  can  construct  an  NNFA  Mr  with  s  +  1  states. 

Proof  The  proof  uses  structural  induction  to  show  that  for  any  regular  expression  R,  we  can  always  compute 
tail(MR)  and  tiuIIr  for  some  NNFA  Mr.  Then  equation  (2)  can  be  used  to  obtain  Mr.  We  assume  a  fixed 
alphabet  E.  There  are  two  base  cases,  which  are  easily  verified. 

tail(Mx)  =  (Q\  =  0,  <§a  =  0,  A\  =  0,  =  0,  F\  =  0,  null\  =  {A})  (3) 

tail(Ma )  =  (Qa  =  {q0},Ia  =  Uo},Fa  =  {go},<Sa  =  0,  Aa  =  {[q0,a]},nulla  =  0),  (4) 

where  a£S,  and  qo  is  a  new  state 

To  use  induction,  we  assume  that  T  and  S  are  two  arbitrary  regular  expressions  equivalent  respectively  to 
NNFA’s  M'i  and  Ms  with  tai^M?)  =  (Qt  ,  It  ,  Ft ,  &t,  At)  and  tail(Ms)  =  (Qs,  Is,  Fs,&s,As),  where  Qt 
and  Qs  are  disjoint.  Then  we  can  easily  verify  that 

tuiliM'i  s)  =  (Qt\s  =  Qt  U  Qs,  &t\s  —  &t  U  6s,  At\s  =  AT  U  As,  It\s  —  It  U  Is, 


Ft\s  —  Ft  U  Fs ,  uuIIt\s  —  tiuIIt  U  nulls) 

(5) 

—  ( Qts  —  Qt  U  Qs,  Sts  =  St  ^  6s  U  FtIs,^-ts  —  At  U  As, 

Its  —  It  U  uuIIt  Is ,  Fts  —  Fs  U  nulls  Ft  ,  uuIIts  —  nullTnull s) 

(6) 

—  ( Qt *  —  Qt  ,  St*  —  St  U  Ft  It,  At*  =  At ,  It*  —  It,  Ft*  —  Ft, 

ii 

Jr 

c 

(7) 

Disjointness  of  the  unions  used  to  form  the  set  of  states  for  the  cases  TlS1  and  TS  proves  the  assertion  about 
the  number  of  states.  We  can  convert  tail(MR )  into  Mr  using  formula  (2)  □ 

The  proof  of  Theorem  1  leads  to  McNaughton  and  Yamada’s  algorithm.  The  construction  of  label  map 
A  shows  that  when  all  of  the  occurrences  of  alphabet  symbols  appearing  in  the  regular  expression  contain 
distinct  symbols,  then  A  is  one-to-one.  In  this  case,  a  DFA  would  be  produced. 

Analysis  determines  that  this  algorithm  falls  short  of  optimal  performance,  because  the  operation  6t  U 
F'll'i  within  formula  (7)  for  tail(MT . )  is  not  disjoint;  all  other  unions  are  disjoint  and  can  be  implemented 
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in  unit  time.  In  particular,  this  overlapping  union  makes  McNaughton  and  Yamada’s  algorithm  use  time 
9{my/rnlogm)  to  transform  regular  expression 


(8) 

into  an  NNFA  with  k  +  1  states  and  m  =  k 2  edges. 


5  Faster  NFA  Construction 

By  recognizing  the  overlapping  union  St  U  Ft  It  within  formula  (7)  for  tail(M t-  )  as  the  source  of  inefficiency, 
we  can  maintain  invariant  nred 't  =  Ft  It  —  It  in  order  to  replace  the  overlapping  union  by  the  equivalent 
disjoint  union  St  U  nredy.  In  order  to  maintain  nredy  as  a  component  of  the  tail  NNFA  computation  given 
above,  we  can  use  the  following  recursive  definition,  obtained  by  simplifying  expression  FrIr  —  Sr  and  using 


the  rules  from  the  proof  of  Theorem  1. 

nred\  =  0  (9) 

nreda  =  FaIa,  where  a  £  E  (10) 

nredT\s  =  nrtd'j  L.  uvuIr  U  F-j  Is  U  FrI'i  (11) 

nredTs  =  FsIt  U  nullsnredT  U  nullrnreds  (12) 

nreds *  =  0  (13) 


Rules  (9),  (10)  and  (13)  are  trivial.  Rule  (11)  follows  from  applying  distributive  laws  to  simplify  formula 

nredT\s  —  (Ft  U  Fs)(It  U  Is)  —  (St  U  6s) 

Rule  (12)  is  obtained  by  applying  distributed  laws  to  simplify  formula, 

nredTs  =  (Fs  U  tiuIIsFt)(It  U  tiuIItIs)  ~  (St  U  Ss  U  Ft  Is) 

Each  union  operation  is  disjoint  and,  hence,  0(1)  time  implementable.  However,  there  is  a  serious  loss 
of  efficiency  computing  cartesian  products  in  rules  (11)  and  (12).  Such  products  do  not  contribute  edges  to 
the  NNFA  for  regular  expressions  TS  when  these  products  belong  to  nredy  and  nulls  is  empty,  or  when 
they  belong  to  nreds  and  nullT  is  empty. 

To  overcome  this  problem  we  will  use  lazy  evaluation  to  compute  cartesian  products  only  when  they 
actually  contribute  edges  to  the  NNFA.  Thus,  instead  of  maintaining  a  union  nredR  of  cartesian  products, 
we  will  maintain  a  set  lazynredR  of  pairs  of  sets.  Consequently,  the  overlapping  union  St  U  Ft  It  within 
formula  (7)  for  tail(MT»)  can  be  replaced  by 

St  U  (U YA,B]£lazynredT  ^  ^)  (14) 
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However,  this  solution  creates  another  problem:  the  sets  forming  F  and  I,  which  are  computed  by  the 
rules  to  construct  the  tail  of  an  NNFA,  must  be  persistent  in  the  following  sense.  Let  the  sets  in  the 
sequence  forming  F  (respectively  I)  be  called  F-sets  (respectively  I-sets).  Each  F-set  (respectively  I-set ) 
could  be  stored  as  a  first  (respectively  second)  component  of  a  pair  belonging  to  lazynred.  Given  any  such 
pair,  we  need  to  iterate  through  the  I-set  S  stored  in  the  second  component  of  the  pair  in  0(|5|)  time. 

The  sequence  of  F-sets  (respectively  I-sets )  are  formed  by  two  operations:  1.  create  a  new  singleton  set; 
and  2.  form  a  new  set  by  taking  the  disjoint  union  of  two  previous  sets  in  the  sequence.  Clearly,  each  of 
these  sequences  can  be  stored  as  a  binary  forest  in  which  each  subtree  in  the  forest  represents  a  set  in  the 
sequence,  where  the  elements  of  the  set  are  stored  in  the  frontier.  By  construction  each  internal  node  in  the 
forest  has  two  children. 

We  call  the  forest  storing  the  F-sets  (respectively  I-sets  )  the  F-f orest  (respectively  I-f orest).  For  each 
node  n  belonging  to  the  F-f  orest  (respectively  I-f  orest),  let  F  set(n)  (respectively  Iset(n))  denote  the  F-set 
(respectively  I-set )  represented  by  n. 

Each  node  in  the  F-forest  and  I-forest  except  the  roots  stores  a  parent  pointer.  Each  node  n  in  the 
I-forest  also  stores  a  pointer  to  the  leftmost  leaf  of  the  subtree  rooted  in  n  and  a  pointer  to  the  rightmost 
leaf  of  the  subtree  rooted  n.  The  frontier  nodes  of  the  I-forest  are  linked. 

This  data  structure  preserves  the  unit-time  disjoint  union  for  F-sets  and  I-sets ,  and  supports  linear  time 
iteration  through  the  frontier  of  any  node  in  the  I-forest.  Since  all  the  F-sets  and  I-sets  are  subsets  of  the 
NFA  states  Q,  the  F-forest  and  I-forest  each  is  stored  in  0(|<3|)  space. 

Theorem  2  For  any  regular  expression  R  we  can  compute  lazynred r  in  time  O(r)  and  auxiliary  space 
0(s ),  where  r  is  the  size  of  regular  expression  R,  and  s  is  the  number  of  occurrences  of  alphabet  symbols 
appearing  in  R. 

Proof  If  T  and  S  are  two  sets,  let  pair(T,S)  =  {[T,  S']}  if  both  T  and  S  are  nonempty;  otherwise,  let 
pair(T,  S )  =  0.  The  proof  makes  use  of  the  following  recursive  definition  of  lazynredn  obtained  from  the 


recursive  definition  of  nredft. 

lazynred \  =  0  (15) 

lazynreda  =  pair(Fa,  Ia),  where  a  £  E  (16) 

lazynredT\s  =  lazynred^  U  lazynredg  U  pair{F^ ,  Is)  U  pair(Fgi  lT)  (17) 

lazynred^s  —  pair(Fs,  It)  U  nullslttzynredr  U  nullrlazynteds  (18) 

lazynreds •  =  0  (19) 


Operation  pair(T,  S )  takes  unit  time  and  space.  Each  union  operation  occurring  in  the  rules  above  is  disjoint 
and,  hence,  implementable  in  unit  time.  Rule  (16)  contributes  unit  time  and  space  for  each  alphabet  symbol 
occurring  in  R,  or  0(s )  time  and  space  overall.  Rule  (17)  contributes  unit  time  for  each  alternation  operator 


appearing  in  R  or  O(r)  time  overall.  It  contributes  two  units  space  for  each  alternation  operator  both  of 
whose  alternands  contain  at  least  one  alphabet  symbol.  Hence,  the  overall  space  contributed  by  this  rule 
is  less  than  2s.  By  a  similar  argument,  Rule  (18)  contributes  0(r )  time  and  less  than  s  space  overall.  The 
other  two  rules  contribute  no  more  than  0(r )  time  overall.  Hence,  the  time  and  space  needed  to  compute 
lazynredft  is  0(r )  and  0(s )  respectively  □ 

By  Theorems  1  and  2,  and  by  the  fact  that  nredji  can  be  computed  from  lazynredn  in  0(\nredj{\ )  time 
using  formula  (14),  we  have  our  first  theoretical  result. 

Theorem  3  For  any  regular  expression  R  we  can  compute  an  equivalent  NNFA  with  s  +  1  states  in  time 
0(r  +  ra)  and  auxiliary  space  0(s ),  where  r  is  the  size  of  regular  expression  R,  m  is  the  number  of  edges  in 
the  NNFA,  and  s  is  the  number  of  occurrences  of  alphabet  symbols  appearing  in  R. 

6  Improving  Space  for  McNaughton  and  Yamada’s  NFA 

Theorem  3  leads  to  a  new  algorithm  that  computes  the  adjacency  form  of  the  NNFA  in  a  single  left-to-right 
shift/reduce  parse  of  the  regular  expression  R.  Although  this  improves  upon  the  algorithm  of  Berry  and 
Sethi,  McNaughton  and  Yamada’s  NNFA  has  certain  theoretical  disadvantages  over  simpler  Thompson’s 
NFA.  Recall  from  example  (8)  that  the  number  of  edges  in  McNaughton  and  Yamada’s  machine  can  be  the 
square  of  the  number  of  edges  in  Thompson’s  machine  (since  Thompson’s  NFA  has  m  =  0{n )).  Consequently, 
Thompson’s  NFA  is  likely  to  be  more  desirable  in  time  and  space  for  DFA  construction  by  subset  construction 
when  the  adjacency  list  implementation  is  used.  We  also  believe  that  the  bit  vector  implementation  will 
rarely  be  more  desirable  than  the  compact  adjacency  list  implementation. 

Nevertheless,  we  can  modify  the  algorithm  just  given  so  that  in  0(r )  time  it  produces  an  O(s)  space 
compressed  NFA  that  encodes  McNaughton  and  Yamada’s  NNFA,  and  that  supports  acceptance  testing  in 
0(s|a;|)  time.  In  the  same  way  that  nredn  was  represented  more  compactly  as  lazynredn ,  we  can  represent 
6r,  which  is  a  union  of  cartesian  products,  as  a  set  lazyb  r  of  pairs  of  set-valued  arguments  of  these  products. 
If  Mr  is  the  NNFA  equivalent  to  regular  expression  R,  then  the  rules  for  tail(Mji )  are  given  just  below: 


lazyb  a 

=  0 

(20) 

lazyba 

=  0 

(21) 

lazybT\s 

=  lazyb?  U  lazybs 

(22) 

lazybrs 

=  pair(FT,  Is)  U  lazyb ?  U  lazybs 

(23) 

lazybs • 

=  lazybs  U lazynreds 

(24) 

After  the  preceding  rules  are  processed  we  can  obtain  a  representation  for  Mr  by  introducing  a  new  state 
qo  and  by  adding  tuple  [ qo,In ]  to  lazyb  in  accordance  with  formula  (2). 


9 


Consequently,  if  T  is  a  subset  of  the  NFA  states  Q,  then  we  can  compute  the  collection  of  sets  b(T,  a)  for 
all  of  the  alphabet  symbols  a  E  £  as  follows.  First  we  compute 

finddomain(T )  =  {A'  :  [X,  Y]  E  lazyb\T  HA'  ^  0} 

which  is  used  to  find  the  set  of  next  states 

nextjstateg(T)  =  {Y  :  [X,  Y]  E  lazyb\X  E  finddomain(T)} 

Finally,  for  each  alphabet  symbol  a  E  £,  we  see  that 

6(T,a )  =  {q  :  Y  E  next _states(T) ,  q  E  Y\A(q)  =  a } 

In  order  to  explain  how  lazyS  is  implemented,  we  will  use  some  additional  terminology.  For  each  F-set 
S  represented  by  node  n  in  the  F-forest,  n  stores  a  pointer  to  a  list  of  nodes  in  the  I-forest  representing  set 
{Y  :  [S',  Y]  E  lazyb}.  Furthermore,  the  F-forest  and  I-forest  are  compressed  to  only  store  nodes  representing 
sets  that  appear  as  the  first  or  second  components  of  a  pair  [X,  Y]  E  lazyb.  In  other  words,  we  make  lazyb 
a  total  onto  binary  relation.  This  can  be  achieved  on-line  as  the  F-forest  and  I-forest  are  constructed  by  a 
kind  of  path  compression  that  affects  the  preprocessing  time  and  space  by  no  more  than  a  small  constant 
factor.  Thus,  we  have 

Theorem  4  For  any  regular  expression  R.  its  equivalent  compressed  NFA,  consisting  of  F-forest,  I-forest 
and  lazyb,  takes  up  0(s )  space  and  can  be  computed  in  time  0(r )  and  auxiliary  space  0{s). 

Proof  Since  each  internal  node  in  the  F-forest  and  1-forest,  have  at  least  two  children,  and  since  their  leaves 
are  distinct  occurrences  of  alphabet  symbols,  they  take  up  O(s)  space.  Each  of  the  unions  in  the  rules  to 
compute  lazyb  is  disjoint,  and  hence  takes  unit  time.  By  the  same  argument  used  to  analyze  the  overall 
space  contributed  by  Rule  (18)  in  the  proof  of  Theorem  2,  we  see  that  Rule  (23)  contributes  O(s)  space  and 
O(r)  overall  to  lazyb.  By  Rule  (19),  Theorem  2,  and  a  simple  application  of  structural  induction,  we  also 
see  that  the  space  contributed  by  Rule  (24)  (which  results  from  adding  lazynred  to  lazyb)  overall  is  O(s). 
The  overall  time  bound  for  each  rule  is  easily  seen  to  be  O(r)  □ 

The  compressed  NFA  also  supports  an  efficient  evaluation  of  the  three  preceding  queries  in  order  to 
simulate  transition  map  b.  The  best  previous  worst  case  time  bound  for  inputing  a  subset  T  of  states  and 
computing  the  collection  of  sets  b(T,a )  for  all  of  the  alphabet  symbols  a  E  E  is  0(|T|  x  |(5(T,  £)|)  using  an 
adjacency  list  implementation  of  McNaughton  and  Yamada’s  NFA,  or  0(r)  using  Thompson’s  NFA. 

In  Theorem  6  we  improve  this  bound,  and  obtain,  essentially,  optimal  asymptotic  time  without  exceeding 
O(s)  space.  This  is  our  main  theoretical  result.  It  explains  the  apparent  superior  performance  of  acceptance 
testing  using  our  compressed  NFA  over  Thompson’s.  It  explains  more  convincingly  why  constructing  a  DFA 
starting  from  our  compressed  NFA  is  at  least  one  order  of  magnitude  faster  than  when  we  start  from  either 
Thompson’s  or  McNaughton  and  Yamada’s  NFA.  These  empirical  results  are  presented  in  section  8. 
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Before  proving  the  theorem,  we  will  first  prove  the  following  technical  lemma. 


Lemma  5  Let  T  be  a  set  of  states  in  the  compressed  NNFA  built  from  regular  expression  R,  and  lazyb p  = 
{[A',Y]  :  [X,Y]  G  lazyb\X  nT^  0}.  Then  \lazybT\  =  0(\T\  +  |6(T,  E)|). 

Proof  The  result  follows  from  proving  that  0(|T|  +  |<5(T,  E)|)  is  a  bound  for  each  of  the  subsets  of  lazybp 
contributed  by  rules  (16),  (17),  (18),  and  (23)  respectively.  The  bound  holds  for  subsets  contributed  by  rules 
(16),  (17),  and  (23),  because  they  form  one-to-one  maps. 

The  proof  for  the  subset  contributed  by  (18)  is  split  into  two  cases.  For  convenience,  let  Tq  denote  the 
set  of  states  in  T  such  that  their  corresponding  symbol  occurrences  appear  in  regular  expression  Q,  where 
<5  is  a  subexpression  of  R.  First,  consider  the  set  A  of  pairs  [Fs,Iq]  G  lazybT  for  subexpressions  QS,  where 
Tq  =  0.  We  claim  that  these  edges  form  a  one-to-many  map,  which  implies  the  bound.  Suppose  this  were 
not  the  case.  Then  we  would  have  a  subexpression  QS,  and  a  subexpression  LP  of  Q  such  that  Iq  =  It  and 
pairs  [Fs,Iq\  and  [Fp,Ip]  belonging  to  A.  Flowever,  since  Q  contains  no  occurrence  of  an  alphabet  symbol 
in  T,  then  P  does  not  either.  Hence,  the  pair  [Fp,  It]  cannot  belong  to  A.  Hence,  the  claim  holds. 

Next,  consider  the  set  B  of  pairs  [Fs,  Iq]  G  lazybp  for  subexpressions  QS,  where  Tq  ^  0.  Proceeding 
from  inner-most  to  outer-most  subexpression  QS,  we  charge  each  pair  [Fs,Iq]  G  B  to  an  uncharged  state 
in  Tq.  A  simple  structural  induction  would  show  that  Tq  contains  at  least  one  uncharged  state.  Let  LP 
be  an  inner-most  subexpression  in  R  such  that  If  is  nonempty,  and  [ Fp,It ]  G  lazybp.  Then  both  Tp  and 
Tp  contains  at  least  one  uncharged  state.  After  an  uncharged  state  in  Tt  is  charged,  Ttp  still  contains  an 
uncharged  state  from  Tp.  The  inductive  step  is  similar.  The  result  follows.  □ 

Theorem  6  Given  any  subset  T  of  the  NNFA  states,  we  can  compute  all  of  the  sets  b(T,  a )  for  every  alphabet 
symbols  a  G  £  in  time  0(\T\  +  |<5(T,  E)|). 

Proof  The  sets  belonging  to  finddomain(T )  are  represented  by  all  the  nodes  Pp  along  the  paths  from  the 
states  belonging  to  T  to  the  roots  of  the  F-forest.  These  nodes  Pp  can  be  found  in  0(|T|  +  |Pt|)  time  by  a 
marked  traversal  of  parent  pointers  in  the  forest.  Observe  that  ]Pp\  can  be  much  larger  than  |T|. 

Computing  next_states(T )  involves  two  steps.  First,  for  each  node  n  G  Pp,  we  traverse  a  nonempty  list 
of  nodes  in  the  I-forest  representing  {Y  :  [Fset(n),Y]  G  lazyb}.  This  step  takes  time  linear  in  the  sum  of 
the  lengths  of  these  lists.  (Observe  that  this  number  can  be  much  larger  than  |Py|.)  Second,  if  Dp  is  the 
set  of  all  nodes  in  the  I-forest  belonging  to  these  lists,  then  nextzstates(T)  =  {Iset(n)  :  n  G  Dp}.  We 
can  compute  the  set  next_states(T )  in  0(  |{[_Fsef(n),  Y]  :  n  G  Pp ,[Fset(n),Y]  G  lazyb}  |)  time,  which  is 
0(|T|  +  |<5(T,  E)|)  time  by  Lemma  5. 

Calculating  b(T,  E)  involves  computing  the  union  of  the  sets  belonging  to  next_states(T).  This  is  achieved 
in  0(  | b(T,  E)  |)  time  using  the  left  and  right  descendant  pointers  stored  in  each  node  belonging  to  Dp ,  travers¬ 
ing  the  unmarked  leaves  in  the  frontier,  and  marking  leaves  as  they  are  traversed.  Multiset  discrimination 
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[7]  can  be  used  to  separate  out  all  of  the  sets  {q  £  S(T,T,)\A(q)  =  a}  for  each  a  £  E  in  time  0(|<$(T,  E)|). 

□ 

Consider  an  NFA  constructed  from  the  following  regular  expression: 

k  s 

(A|(A|(...(A|a)T)'-0T 

In  order  to  follow  transitions  labeled  ’a’,  we  have  to  examine  0(n2)  edges  and  0(n)  states  in  0(n2)  time 
for  McNaughton  and  Yamada’s  NFA,  Q(kn)  states  and  edges  in  Q(kn)  time  for  Thompson’s  machine,  and 
0(n)  states  and  edges  in  0(n)  time  for  our  compressed  NFA. 

7  Further  Optimization 

In  this  section,  we  introduce  a  simple  transformation  that  can  greatly  improve  the  compressed  NFA  repre¬ 
sentation.  If  lazyS  contains  both  [R,U]  and  [5,  U],  and  if  there  exists  an  F-set  T  —  R  U  S,  then  we  can 
replace  [R,U]  and  [>‘,  If  within  lazyS  by  a  single  pair  [V,  //].  Similarly,  If  lazyS  contains  both  [U,R]  and 
\U,  5],  and  if  there  exists  an  I-set  T  —  R  U  S,  then  we  can  replace  [{/,  R]  and  [U,  5]  within  lazyb  by  a  single 
pair  [IJ,  T] .  We  call  this  technique  packing.  In  a  single  linear  time  bottom  up  traversal  of  the  I-f orest  and 
the  F-f orest,  we  can  simplify  lazyS  by  packing.  In  the  case  of  regular  expression  (ai|a2|  ■  ■  -  an)*  packing  can 
simplify  lazyS  from  3n  —  2  pairs  into  a  single  pair,  (see  Figure  1.) 


F-tree  I-tree 


Figure  1:  Compressed  NFA  of  (a,i|a2|  •  ■  •  an)*  after  packing  F-sets  and  I-sets. 

At  the  same  time,  we  can  carry  out  the  same  kind  of  path  compression  described  in  Section  6,  so  that 
the  F-f  orest  and  I- forest  only  contain  nodes  in  the  domain  (respectively  range)  of  lazyS.  However,  whereas 
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previously  the  forest  leaves  (corresponding  to  NFA  states)  were  unaffected  by  compression,  the  packing 
transformation  can  remove  leaves  in  the  F- forest  and  I- forest  from  the  domain  and  (respectively)  range  of 
lazyb.  When  path  compression  eliminates  leaves,  we  need  to  turn  the  symbol  assignment  map  A  into  a 
multi-valued  mapping;  that  is,  whenever  leaves  ql,...,qk  are  replaced  by  leaf  q,  we  take  the  following  steps; 

•  remove  the  old  leaves  ql,...,qk  from  the  domain  of  A; 

•  assign  the  set  of  symbols  {i:s£  {q  1,  ...,  qk },  [s,  x]  £  A}  to  A  at  q. 

As  an  example  of  this,  consider  regular  expression  (eq  |ct2 1  •  •  •  an)*  once  again.  Path  compression  will 
turn  the  data  structure  shown  in  Figure  1  into  the  one  depicted  in  Figure  2.  In  using  our  compressed 

Cl  C! 

F -tree  I-tree 

Figure  2:  Compressed  NFA  of  (ai|ct2|  •  •  -  an)*  after  Packing  and  Path  Compression. 

representation  to  simulate  an  NFA,  the  transition  edge  t  (see  Figure  2)  can  be  taken  only  if  the  current 
transition  symbol  belongs  to  {ai,a2,  ■  •  -an}  which  labels  node  C\. 

Packing  and  path  compression  can  not  only  speedup  acceptance  testing  but  improve  DFA  construction 
dramatically.  In  the  remainder  of  this  paper,  we  call  our  optimized  compressed  NFA  representation  CNFA. 

8  Performance  Benchmark 

Experiments  to  benchmark  the  performance  of  the  CNFA  have  been  carried  out  for  a  range  of  regular  expres¬ 
sion  patterns  against  a  number  of  machines  including  Thompson’s  NFA,  an  optimized  form  of  Thompson’s 
NFA,  and  McNaughton  and  Yamada’s  NFA[13].  We  build  Thompson’s  NFA  according  to  the  construction 
rules  described  in  [2].  Thompson’s  NFA  usually  contains  excessively  redundant  states  and  A-edges.  However, 
to  our  knowledge  there  is  no  obvious/efficient  algorithm  to  optimize  Thompson’s  NFA  without  blowing  up 
the  linear  space  constraint.  We  therefore  devise  some  simple  but  effective  transformations  that  eliminate 
redundant  states  and  edges  in  most  of  the  test  cases. 

Our  acceptance  testing  experiments  show  that  the  CNFA  outperforms  Thompson’s  NFA,  Thompson’s 
NFA  optimized,  and  McNaughton  and  Yamada’s  NFA.  For  regular  expression  (a\b  ••  ■)*,  the  CNFA  is  12 
times  faster  than  Thompson’s  NFA,  2  times  faster  than  Thompson’s  NFA  optimized,  and  50%  faster  than 
McNaughton  and  Yamada’s  NFA.  For  regular  expression  ((a|A)(6|A)  •••)*,  which  is  equivalent  to  {a\b\  •••)*, 
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the  CNFA  is  16  times  faster  than  Thompson’s  NFA,  8  times  faster  than  Thompson’s  NFA  optimized,  and  50% 
faster  than  McNaughton  and  Yamada’s  NFA.  For  regular  expression  ((a|A)(6|A)  •  •  •— )*,  which  accepts  zero 
or  more  instances  of  an  ordered  string  followed  by  a  the  CNFA  is  2  times  faster  than  Thompson’s  NFA, 
25%  faster  than  Thompson’s  NFA  optimized,  but  80%  slower  than  McNaughton  and  Yamada’s  NFA.  For 
((a|A)n—  )*,  the  CNFA  is  comparable  to  Thompson’s  machine,  50%  slower  than  Thompson’s  NFA  optimized, 
and  linearly  faster  than  McNaughton  and  Yamada’s  NFA  1 .  For  ( abc  ■  ■  •)  and  ( abc  •••)*,  the  CNFA  is  75%) 
slower  than  Thompson’s  NFA  and  McNaughton  and  Yamada’s  NFA,  and  55%  slower  than  Thompson’s  NFA 
optimized.  However,  acceptance  testing  for  concatenation  is  quite  fast  for  each  of  the  NFA’s  being  compared, 
and  would  not  degrade  our  speedup  ratio  in  general.  Acceptance  testing  with  a  realistic  programming 
language  pattern  shows  that  the  CNFA  is  7  times  faster  than  Thompson’s  NFA,  60%)  faster  than  Thompson’s 
NFA  optimized,  and  2  times  faster  than  McNaughton  and  Yamada’s  NFA. 

The  benchmark  for  subset  construction  is  more  favorable.  The  CNFA  outperforms  the  other  machines 
not  only  in  DFA  construction  time  but  also  in  constructed  machine  size.  Subset  construction  is  compared 
among  the  following  five  starting  machines:  the  CNFA,  Thompson’s  NFA,  Thompson’s  NFA  optimized, 
Thompson’s  NFA  using  important-state  heuristic[2],  and  McNaughton  and  Yamada’s  NFA.  Below  is  a  high 
level  modified  specification  of  the  classical  Rabin  and  Scott  subset  construction  for  producing  a  DFA  cr  from 
an  NFA  <5: 

<j  :=  0 

workset  :=  {{§o}} 
while  3 S  G  workset  do 

workset  :=  workset  -  {S'} 

for  each  symbol  a  E  S  and  set  of  states  6(S,  M )  |  .-1  ( a- )  =  a}  where  B  yl  0  do 

<7|  S,  a )  :=  B 
B  e-closure(H) 
if  B  does  not  belong  to  cr  then 
workset  :=  workset  U{B} 
end  if 
end  for 
end  while 


We  implemented  the  preceding  specification  tailored  to  the  CNFA  and  other  machines.  The  only  differ¬ 
ences  in  these  implementations  is  in  the  calculation  of  <S(S,  E),  where  we  use  the  efficient  procedure  described 
by  Theorem  6,  and  in  the  e-closure  step,  which  is  performed  only  by  Thompson’s  NFA.  The  CNFA  achieves 

1  McNaughton  and  Yamada’s  NFA  suffers  from  a  quadratic  number  of  edges  in  this  test  pattern. 
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linear  speedup  and  constructs  a  linearly  smaller  DFA  in  many  of  the  test  cases.  See  Figure  3  and  4  for  a 
benchmark  summary.  The  raw  timing  data  is  given  in  the  Appendix.  All  the  tests  described  in  this  paper 
are  performed  on  a  lightly  loaded  SUN  3/250  server.  We  used  getitimerQ  and  setitimerQ  primitives 
[19]  to  measure  program  execution  time.  It  is  interesting  to  note  that  the  CNFA  has  a  better  speedup  ratio 
on  SUN  Sparc  based  computers. 


pattern 

TNFA 

TNFA  (imp.  state) 

opt.  TNFA 

MYNFA 

[abc  •  •  •)  * 

5  times  faster 

comparable 

comparable 

comparable 

(a|6|  -  -  -)* 

quadratic  speedup 

linear  speedup 

linear  speedup 

linear  speedup 

(0|l-..|9)n 

70  times  faster 

10  times  faster 

20%  faster 

10  times  faster 

((o|A)(6|A)...-)* 

linear  speedup 

20%  faster 

linear  speedup 

5%  faster 

(HAH&IA)...)’ 

quadratic  speedup 

linear  speedup 

quadratic  speedup 

linear  speedup 

(a|6)  *a(a|6)Tl 

2.5  times  faster 

comparable 

10%  slower 

50%  faster 

programming  language 

800  times  faster 

6  times  faster 

60%  faster 

6  times  faster 

Figure  3:  CNFA  Subset  Construction  Speedup  Ratio 


pattern 

TNFA 

TNFA  (imp.  state) 

opt.  TNFA 

MYNFA 

( abc  •  •  •)  * 

comparable 

comparable 

comparable 

comparable 

(a|6|c-  •  ■)* 

linearly  smaller 

linearly  smaller 

comparable 

linearly  smaller 

(0|1|”*9)" 

200  times  smaller 

10  times  smaller 

comparable 

10  times  smaller 

((a|A)(6|A)-..-)* 

3  times  smaller 

comparable 

comparable 

comparable 

«a|A)(6|A)  •  •  •)* 

linearly  smaller 

linearly  smaller 

linearly  smaller 

linearly  smaller 

(a|6)*a(a|6)n 

4  times  smaller 

comparable 

comparable 

comparable 

programming  language 

10  times  smaller 

5  times  smaller 

20%  larger 

5  times  smaller 

Figure  4:  DFA  Size  Improvement  Ratio  Starting  from  the  CNFA 


9  Conclusion 

Theoretical  analysis  and  confirming  empirical  evidence  demonstrates  that  our  proposed  CNFA  leads  to  a 
substantially  more  efficient  way  of  turning  regular  expressions  into  DFA’s  (and  minimum  state  DFA’s  in 
particular)  than  other  NFA’s  in  current  use.  It  would  be  interesting  future  research  to  analyze  the  effect  of 
packing  and  path  compression  on  the  CNFA.  It  would  also  be  worthwhile  to  obtain  a  sharper  analysis  of  the 
constant  factors  in  comparing  the  CNFA  with  other  NFA’s. 


15 


References 


[1]  Aho,  A.,  Hopcroft,  J.  and  Ullman  J.,  “Design  and  Analysis  of  Computer  Algorithms”  ,  Reading,  Addison- 
Wesley,  1974. 

[2]  Aho,  A.,  Sethi,  R.  and  Ullman,  J.,  “Compilers  Principles,  Techniques,  and  Tools”,  Reading,  Addison- 
Wesley,  1986. 

[3]  Aho,  A.,  “Pattern  Matching  in  Strings”,  in  Formal  Language  Theory,  ed.  R.  V.  Book,  Academic  Press, 
Inc.  1980. 

[4]  Berry,  G.  and  Cosserat,  L.,  “The  Esterel  synchronous  programming  language  and  its  mathematical 
semantics”  in  Seminar  in  Concurrency,  S.  D.  Brookes,  A.  W.  Roscoe,  and  G.  Winskel,  eds.,  LNCS  197, 
Springer-Verlag,  1985. 

[5]  Berry,  G.  and  Sethi,  R.,  “From  Regular  Expressions  to  Deterministic  Automata”  Theoretical  Computer 
Science,  48  (1986),  pp.  117-126. 

[6]  Brzozowski,  J.,  “Derivatives  of  Regular  Expressions”,  JACM,  Vol.  11,  No.  4.,  Oct.  1964,  pp.  481-494. 

[7]  Cai,  J.  and  Paige,  R.,  “Look  Ma,  No  Hashing,  And  No  Arrays  Neither”,  ACM  POPL,  Jan.  1991,  pp. 
143  -  154. 

[8]  Emerson,  E.  and  Lei,  C.,  “Model  Checking  in  the  Propositional  Mu-Calculus”,  Proc.  IEEE  Conf.  on 
Logic  in  Computer  Science,  1986,  pp.  86  -  106. 

[9]  Fredman,  M.,  Komlos,  J.,  and  Szemeredi,  E.,  “Storing  a  Sparse  Table  with  0(1)  Worst  Case  Access 
Time”,  JACM,  vol.  31,  no.  3,  pp.  538-544,  July,  1984. 

[10]  Hopcroft,  J.  and  Ullman,  J.,  “Formal  Languages  and  Their  Relation  to  Automata” , Reading,  Addison- 
Wesley,  1969. 

[11]  Kleene,  S.,  “Representation  of  events  in  nerve  nets  and  finite  automata”,  in  Automata  Studies,  Ann. 
Math.  Studies  No.  34,  Princeton  U.  Press,  1956,  pp.  3  -  41. 

[12]  Knuth,  D.,  “On  the  translation  of  languages  from  left  to  right”,  Information  and  Control,  Vol.  8,  Num. 
6,  1965,  pp.  607  -  639. 

[13]  McNaughton,  R.  and  Yamada,  H.  “Regular  Expressions  and  State  Graphs  for  Automata”,  IRA  Trans, 
on  Electronic  Computers,  Vol.  EC-9,  No.  1,  Mar.  1960,  pp  39-47. 

[14]  Myhill,  J.,  “Finite  automata  and  representation  of  events,”  WADC,  Tech.  Rep.  57-624,  1957. 

[15]  Nerode,  A.,  “Linear  automaton  transformations,”  Proc.  Amer.  Math  Soc.,  Vol.  9,  pp.  541  -  544,  1958. 


16 


[16]  Rabin,  M.  and  Scott,  D.,  “Finite  automata  and  their  decision  problems”  IBM  J.  Res.  Develop.,  Vol.  3, 
No.  2,  Apr.,  1959,  pp.  114  -  125. 

[17]  Ritchie,  D.  and  Thompson,  K.  “The  UNIX  Time-Sharing  System”  Communication  ACM,  Vol.  17,  No. 
7,  Jul..  1974,  pp.  365  -  375. 

[18]  Thompson,  K.,  ’’Regular  Expression  search  Algorithm”,  Communication  ACM  11:6  (1968),  pp.  419-422. 

[19]  “SunOS  Reference  Manual  VOL.  II”,  Programmer’s  Manual,  SUN  microsystems,  1989. 

[20]  Ullman,  J.,  “Computational  Aspects  of  VLSI” ,  Computer  Science  Press,  1984. 


17 


APPENDIX:  Benchmark  Results  2 


Acceptance  Testing  Benchmark 


( abc  ■  ■  ■) 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

1000 

0.14 

0.34 

0.58 

0.76 

0.18 

0.18 

0.45 

0.24 

0.76 

1500 

0.20 

0.52 

1.00 

1.18 

0.30 

0.18 

0.44 

0.25 

0.84 

2000 

0.30 

0.74 

1.32 

1.58 

0.42 

0.19 

0.47 

0.27 

0.84 

2500 

0.38 

0.90 

1.60 

2.00 

0.54 

0.19 

0.45 

0.27 

0.80 

3000 

0.44 

1.12 

2.00 

2.44 

0.66 

0.18 

0.46 

0.27 

0.82 

4000 

0.64 

1.54 

2.58 

3.32 

0.84 

0.19 

0.46 

0.25 

0.78 

4500 

0.72 

1.56 

2.78 

3.42 

0.96 

0.21 

0.46 

0.28 

0.82 

5000 

0.70 

1.48 

2.88 

3.56 

0.92 

0.20 

0.42 

0.26 

0.81 

( abc  ■  •  •)* 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

10 

0.32 

0.62 

1.10 

1.56 

0.32 

0.21 

0.40 

0.21 

0.71 

20 

0.26 

0.60 

1.12 

1.44 

0.34 

0.18 

0.42 

0.24 

0.78 

30 

0.28 

0.64 

1.12 

1.42 

0.36 

0.20 

0.45 

0.25 

0.79 

40 

0.28 

0.64 

1.08 

1.44 

0.34 

0.19 

0.44 

0.24 

0.75 

50 

0.26 

0.64 

1.08 

1.48 

0.38 

0.18 

0.43 

0.26 

0.73 

60 

0.28 

0.64 

1.12 

1.50 

0.34 

0.19 

0.43 

0.23 

0.75 

70 

0.26 

0.66 

1.10 

1.46 

0.34 

0.18 

0.45 

0.23 

0.75 

80 

0.26 

0.64 

1.10 

1.46 

0.32 

0.18 

0.44 

0.22 

0.75 

90 

0.28 

0.64 

1.14 

1.46 

0.36 

0.19 

0.44 

0.25 

0.78 

100 

0.26 

0.64 

1.10 

1.46 

0.34 

0.18 

0.44 

0.23 

0.75 

{a\b\c  ■  ■  •)* 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

10 

5.46 

1.36 

4.20 

1.76 

0.82 

3.10 

0.77 

0.47 

2.39 

20 

10.48 

2.18 

7.52 

2.02 

1.38 

5.19 

1.08 

0.68 

3.72 

30 

15.70 

3.04 

10.86 

2.18 

1.86 

4.85 

1.39 

0.85 

4.98 

40 

21.16 

3.76 

14.28 

2.56 

2.42 

8.27 

1.47 

0.95 

5.58 

50 

26.22 

4.60 

17.28 

2.84 

3.00 

9.23 

1.62 

1.06 

6.08 

60 

31.62 

5.46 

22.56 

3.12 

3.66 

10.13 

1.75 

1.17 

7.23 

70 

36.62 

6.20 

23.94 

3.26 

4.36 

11.23 

1.90 

1.34 

7.34 

80 

42.02 

7.12 

27.38 

3.56 

5.22 

11.24 

2.00 

1.47 

7.69 

90 

47.94 

7.92 

30.44 

3.90 

6.00 

12.29 

2.03 

1.54 

7.81 

100 

52.00 

8.70 

35.10 

4.10 

6.88 

12.68 

2.12 

1.68 

8.56 

2  All  tests  are  performed  on  a  SUN  3/250  server.  Benchmark  time  is  in  seconds. 
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(HA)"—) 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

10 

7.14 

4.30 

5.50 

8.10 

3.96 

0.88 

0.53 

0.49 

0.70 

20 

12.94 

7.14 

9.14 

13.76 

12.14 

0.94 

0.52 

0.88 

0.69 

30 

19.90 

10.60 

13.76 

20.74 

26.12 

0.96 

0.51 

1.26 

0.66 

40 

25.92 

12.90 

17.06 

26.22 

42.16 

0.99 

0.49 

1.61 

0.65 

50 

31.46 

16.82 

22.36 

34.26 

66.54 

0.92 

0.49 

1.94 

0.65 

60 

36.10 

24.98 

39.78 

91.74 

0.91 

0.48 

2.31 

0.63 

70 

43.56 

22.96 

29.54 

46.04 

127.28 

0.95 

0.50 

2.76 

0.64 

80 

51.18 

25.96 

35.20 

53.60 

171.02 

0.95 

0.48 

3.19 

0.66 

90 

52.66 

26.80 

35.54 

54.24 

187.56 

1.01 

0.51 

3.46 

0.68 

100 

61.30 

31.12 

41.00 

63.44 

248.04 

0.97 

0.49 

3.91 

0.65 

(HA)(6|A)...)* 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

10 

7.08 

3.92 

4.26 

1.94 

0.86 

3.56 

2.02 

0.44 

2.20 

20 

13.06 

7.42 

7.60 

2.06 

1.38 

6.34 

3.60 

0.67 

3.69 

30 

19.92 

10.96 

10.78 

2.30 

1.92 

8.66 

4.77 

0.83 

4.69 

40 

26.32 

14.38 

14.16 

2.52 

2.44 

10.44 

5.71 

0.97 

5.62 

50 

32.32 

18.00 

17.34 

2.84 

3.00 

11.38 

6.34 

1.06 

6.11 

60 

37.68 

21.66 

20.78 

3.10 

3.66 

12.15 

6.99 

1.18 

6.70 

70 

43.82 

25.12 

24.06 

3.24 

4.34 

13.52 

7.75 

1.34 

7.42 

80 

51.54 

28.48 

27.78 

3.58 

5.18 

14.40 

7.96 

1.45 

7.75 

90 

57.80 

32.08 

30.80 

3.88 

5.90 

14.89 

8.27 

1.52 

7.94 

100 

64.56 

35.46 

33.98 

4.06 

6.86 

15.90 

8.73 

1.69 

8.37 

((a\\)(b\\)  ■■■-)* 


length 

TNFA 

TNFA 

unopt. 

CNFA 

CNFA 

MYNFA 

TNFA  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

unopt.  CNFA  vs 

CNFA 

10 

4.40 

2.82 

2.66 

3.36 

0.68 

1.31 

0.84 

0.20 

0.79 

20 

8.08 

4.94 

4.08 

4.86 

1.00 

1.72 

1.02 

0.21 

0.83 

30 

12.30 

7.16 

5.50 

6.54 

1.34 

1.88 

1.09 

0.20 

0.84 

40 

16.06 

9.22 

6.72 

7.96 

1.64 

2.01 

1.16 

0.21 

0.87 

50 

19.22 

11.24 

8.10 

9.34 

1.88 

2.05 

1.20 

0.20 

0.87 

60 

23.46 

13.90 

9.80 

11.04 

2.38 

2.13 

1.26 

0.22 

0.89 

70 

27.32 

16.18 

11.22 

12.68 

2.84 

2.15 

1.28 

0.22 

0.89 

80 

31.90 

18.18 

12.72 

14.34 

3.16 

2.22 

1.27 

0.22 

0.89 

90 

34.80 

19.92 

13.64 

15.34 

3.40 

2.27 

1.30 

0.22 

0.89 

100 

38.98 

22.04 

14.96 

18.50 

3.78 

2.10 

1.19 

0.20 

0.81 
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Subset  Construction  Benchmark 


( abc  ■  ■  •)* 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

100 

0.04 

0.02 

0.02 

0.04 

0.02 

0.02 

1.00 

1.00 

1.00 

200 

0.12 

0.04 

0.06 

0.06 

0.04 

0.06 

1.00 

1.50 

1.50 

300 

0.16 

0.06 

0.04 

0.06 

0.04 

0.08 

1.50 

1.00 

2.00 

400 

0.28 

0.10 

0.08 

0.08 

0.10 

0.08 

1.00 

0.80 

0.80 

500 

0.40 

0.12 

0.10 

0.12 

0.12 

0.10 

1.00 

0.83 

0.83 

600 

0.56 

0.12 

0.14 

0.14 

0.14 

0.14 

0.86 

1.00 

1.00 

700 

0.74 

0.14 

0.18 

0.14 

0.14 

0.20 

1.00 

1.29 

1.43 

800 

1.04 

0.20 

0.20 

0.20 

0.18 

0.24 

1.11 

1.11 

1.33 

900 

1.12 

0.18 

0.22 

0.22 

0.18 

0.22 

1.00 

1.22 

1.22 

1000 

1.72 

0.26 

0.20 

0.20 

0.22 

0.22 

1.18 

0.91 

1.00 

Constructed  DFA  Size 


length 

node  no  edge  no  node  weight  edge  weight 

length 

node  no  edge  no  node  weight  edge  weight 

TNFA 

100 

101  101  105  103 

200 

201  201  205  203 

TNFA  (imp.  state) 

100 

101  101  101  101 

200 

201  201  201  201 

opt.  TNFA 

100 

100  100  100  100 

200 

200  200  200  200 

unopt.  CNFA 

100 

101  101  101  101 

200 

201  201  201  201 

CNFA 

100 

101  101  101  101 

200 

201  201  201  201 

MYNFA 

100 

101  101  101  101 

200 

201  201  201  201 

TNFA 

300 

301  301  305  303 

400 

401  401  405  403 

TNFA  (imp.  state) 

300 

301  301  301  301 

400 

401  401  401  401 

opt.  TNFA 

300 

300  300  300  300 

400 

400  400  400  400 

unopt.  CNFA 

300 

301  301  301  301 

400 

401  401  401  401 

CNFA 

300 

301  301  301  301 

400 

401  401  401  401 

MYNFA 

300 

301  301  301  301 

400 

401  401  401  401 

TNFA 

500 

501  501  505  503 

600 

601  601  605  603 

TNFA  (imp.  state) 

500 

501  501  501  501 

600 

601  601  601  601 

opt.  TNFA 

500 

500  500  500  500 

600 

600  600  600  600 

unopt.  CNFA 

500 

501  501  501  501 

600 

601  601  601  601 

CNFA 

500 

501  501  501  501 

600 

601  601  601  601 

MYNFA 

500 

501  501  501  501 

600 

601  601  601  601 

TNFA 

700 

701  701  705  703 

800 

801  801  805  803 

TNFA  (imp.  state) 

700 

701  701  701  701 

800 

801  801  801  801 

opt.  TNFA 

700 

700  700  700  700 

800 

800  800  800  800 

unopt.  CNFA 

700 

701  701  701  701 

800 

801  801  801  801 

CNFA 

700 

701  701  701  701 

800 

801  801  801  801 

MYNFA 

700 

701  701  701  701 

800 

801  801  801  801 

TNFA 

900 

901  901  905  903 

1000 

1001  1001  1005  1003 

TNFA  (imp.  state) 

900 

901  901  901  901 

1000 

1001  1001  1005  1003 

opt.  TNFA 

900 

900  900  900  900 

1000 

1000  1000  1000  1000 

unopt.  CNFA 

900 

901  901  901  901 

1000 

1001  1001  1005  1003 

CNFA 

900 

901  901  901  901 

1000 

1001  1001  1005  1003 

MYNFA 

900 

901  901  901  901 

1000 

1001  1001  1005  1003 

20 


(a\b\c  ■  •  •)* 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

10 

0.08 

0.02 

0.00 

0.02 

0.00 

0.02 

- 

1.00 

- 

20 

0.92 

0.06 

0.00 

0.06 

0.02 

0.04 

3.00 

0.00 

2.00 

30 

4.44 

0.08 

0.02 

0.08 

0.02 

0.08 

4.00 

1.00 

4.00 

40 

12.70 

0.14 

0.00 

0.18 

0.00 

0.16 

- 

1.00 

- 

50 

29.94 

0.24 

0.02 

0.24 

0.02 

0.28 

12.00 

1.00 

14.00 

60 

61.18 

0.46 

0.02 

0.36 

0.02 

0.30 

23.00 

1.00 

15.00 

70 

111.88 

0.50 

0.04 

0.46 

0.02 

0.46 

25.00 

2.00 

23.00 

80 

188.74 

0.64 

0.04 

0.54 

0.02 

0.58 

32.00 

2.00 

29.00 

90 

300.72 

0.78 

0.06 

0.78 

0.02 

0.74 

39.00 

3.00 

37.00 

100 

450.90 

0.96 

0.06 

0.94 

0.02 

0.92 

48.00 

3.00 

46.00 

Constructed  DFA  Size 


1 

machine 

length 

node  no 

edge  no 

node  weight 

edge  weight 

length 

node  no 

edge  no 

node  weight 

edge  weight 

TNFA 

10 

11 

110 

285 

2904 

20 

21 

420 

1070 

21609 

TNFA  (imp.  state) 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

opt.  TNFA 

10 

1 

10 

1 

10 

20 

1 

20 

1 

20 

unopt.  CNFA 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

CNFA 

10 

2 

20 

2 

20 

20 

2 

40 

2 

40 

MYNFA 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

TNFA 

30 

31 

930 

2355 

71114 

40 

41 

1640 

4140 

166419 

TNFA  (imp.  state) 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

opt.  TNFA 

30 

1 

30 

1 

30 

40 

1 

40 

1 

40 

unopt.  CNFA 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

CNFA 

30 

2 

60 

2 

60 

40 

2 

80 

2 

80 

MYNFA 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

TNFA 

50 

51 

2550 

6425 

322524 

60 

61 

3660 

9210 

554429 

TNFA  (imp.  state) 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

opt.  TNFA 

50 

1 

50 

1 

50 

60 

1 

60 

1 

60 

unopt.  CNFA 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

CNFA 

50 

2 

100 

2 

100 

60 

2 

120 

2 

120 

MYNFA 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

TNFA 

70 

71 

4970 

12495 

877134 

80 

81 

6480 

16280 

1305639 

TNFA  (imp.  state) 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

opt.  TNFA 

70 

1 

70 

1 

70 

80 

1 

80 

1 

80 

unopt.  CNFA 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

CNFA 

70 

2 

140 

2 

140 

80 

2 

160 

2 

160 

MYNFA 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

TNFA 

90 

91 

8190 

20565 

1854944 

100 

101 

10100 

25350 

2540049 

TNFA  (imp.  state) 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

opt.  TNFA 

90 

1 

90 

1 

90 

100 

1 

100 

1 

100 

unopt.  CNFA 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

CNFA 

90 

2 

180 

2 

180 

100 

2 

200 

2 

200 

MYNFA 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

21 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

25 

1.54 

0.28 

0.04 

0.20 

0.02 

0.26 

14.00 

2.00 

13.00 

50 

3.16 

0.48 

0.06 

0.44 

0.04 

0.52 

12.00 

1.50 

13.00 

75 

4.82 

0.78 

0.08 

0.68 

0.06 

0.80 

13.00 

1.33 

13.33 

100 

6.50 

1.02 

0.12 

0.90 

0.12 

1.00 

8.50 

1.00 

8.33 

125 

8.16 

1.26 

0.14 

1.08 

0.10 

1.20 

12.60 

1.40 

12.00 

150 

9.80 

1.60 

0.18 

1.38 

0.12 

1.54 

13.33 

1.50 

12.83 

175 

11.42 

1.70 

0.20 

1.46 

0.14 

1.72 

12.14 

1.43 

12.29 

200 

13.20 

2.06 

0.24 

1.74 

0.18 

1.94 

11.44 

1.33 

10.78 

225 

14.98 

2.34 

0.24 

2.28 

0.20 

2.32 

11.70 

1.20 

11.60 

250 

16.30 

2.58 

0.28 

2.52 

0.22 

2.70 

11.73 

1.27 

12.27 

Constructed  DFA  Size 


machine 

length 

node  no 

edge  no 

node  weight 

edge  weight 

length 

node  no 

edge  no 

node  weight 

edge  weight 

TNFA 

25 

251 

2410 

5939 

57004 

50 

501 

4910 

12039 

118004 

TNFA  (imp.  state) 

25 

251 

2410 

251 

2410 

50 

501 

4910 

501 

4910 

opt.  TNFA 

25 

26 

250 

26 

250 

50 

51 

500 

51 

500 

unopt.  CNFA 

25 

251 

2410 

251 

2410 

50 

501 

4910 

501 

4910 

CNFA 

25 

26 

250 

26 

250 

50 

51 

500 

51 

500 

MYNFA 

25 

251 

2410 

251 

2410 

50 

501 

4910 

501 

4910 

TNFA 

75 

751 

7410 

18139 

179004 

100 

1001 

9910 

24239 

240004 

TNFA  (imp.  state) 

75 

751 

7410 

751 

7410 

100 

1001 

9910 

1001 

9910 

opt.  TNFA 

75 

76 

750 

76 

750 

100 

101 

1000 

101 

1000 

unopt.  CNFA 

75 

751 

7410 

751 

7410 

100 

1001 

9910 

1001 

9910 

CNFA 

75 

76 

750 

76 

750 

100 

101 

1000 

101 

1000 

MYNFA 

75 

751 

7410 

751 

7410 

100 

1001 

9910 

1001 

9910 

TNFA 

125 

1251 

12410 

30339 

301004 

150 

1501 

14910 

36439 

362004 

TNFA  (imp.  state) 

125 

1251 

12410 

1251 

12410 

150 

1501 

14910 

1501 

14910 

opt.  TNFA 

125 

126 

1250 

126 

1250 

150 

151 

1500 

151 

1500 

unopt.  CNFA 

125 

1251 

12410 

1251 

12410 

150 

1501 

14910 

1501 

14910 

CNFA 

125 

126 

1250 

126 

1250 

150 

151 

1500 

151 

1500 

MYNFA 

125 

1251 

12410 

1251 

12410 

150 

1501 

14910 

1501 

14910 

TNFA 

175 

1751 

17410 

42539 

423004 

200 

2001 

19910 

48639 

484004 

TNFA  (imp.  state) 

175 

1751 

17410 

1751 

17410 

200 

2001 

19910 

2001 

19910 

opt.  TNFA 

175 

176 

1750 

176 

1750 

200 

201 

2000 

201 

2000 

unopt.  CNFA 

175 

1751 

17410 

1751 

17410 

200 

2001 

19910 

2001 

19910 

CNFA 

175 

176 

1750 

176 

1750 

200 

201 

2000 

201 

2000 

MYNFA 

175 

1751 

17410 

1751 

17410 

200 

2001 

19910 

2001 

19910 

TNFA 

225 

2251 

22410 

54739 

545004 

250 

2501 

24910 

60839 

606004 

TNFA  (imp.  state) 

225 

2251 

22410 

2251 

22410 

250 

2501 

24910 

2501 

24910 

opt.  TNFA 

225 

226 

2250 

226 

2250 

250 

251 

2500 

251 

2500 

unopt.  CNFA 

225 

2251 

22410 

2251 

22410 

250 

2501 

24910 

2501 

24910 

CNFA 

225 

226 

2250 

226 

2250 

250 

251 

2500 

251 

2500 

MYNFA 

225 

2251 

22410 

2251 

22410 

250 

2501 

24910 

2501 

24910 

((a\X)(b\X)  ■■■-)* 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

25 

0.34 

0.04 

0.10 

0.04 

0.04 

0.04 

1.00 

2.50 

1.00 

50 

3.86 

0.14 

0.40 

0.12 

0.12 

0.14 

1.17 

3.33 

1.17 

75 

18.16 

0.30 

1.34 

0.24 

0.28 

0.28 

1.07 

4.79 

1.00 

100 

55.50 

0.48 

2.90 

0.44 

0.46 

0.50 

1.04 

6.30 

1.09 

125 

132.52 

0.80 

5.72 

0.66 

0.76 

0.74 

1.05 

7.53 

0.97 

150 

270.34 

1.14 

9.24 

0.96 

1.02 

1.06 

1.12 

9.06 

1.04 

175 

496.94 

1.54 

14.64 

1.26 

1.38 

1.50 

1.12 

10.61 

1.09 

200 

839.94 

2.04 

20.94 

1.62 

1.76 

1.88 

1.16 

11.90 

1.07 

225 

1392.42 

3.20 

32.02 

2.12 

2.40 

2.44 

1.33 

13.34 

1.02 

250 

2065.14 

3.16 

41.62 

2.78 

2.74 

2.90 

1.15 

15.19 

1.06 

Constructed  DFA  Size 


machine 

length 

node  no 

edge  no 

node  weight 

edge  weight 

length 

node  no 

edge  no 

node  weight 

edge  weight 

TNFA 

25 

27 

351 

1027 

8476 

50 

52 

1281 

3928 

65076 

TNFA  (imp.  state) 

25 

27 

351 

27 

351 

50 

52 

1281 

53 

1326 

opt.  TNFA 

25 

27 

351 

27 

351 

50 

52 

1281 

53 

1326 

unopt.  CNFA 

25 

27 

351 

27 

351 

50 

52 

1281 

53 

1326 

CNFA 

25 

27 

351 

27 

351 

50 

52 

1281 

53 

1326 

MYNFA 

25 

27 

351 

27 

351 

50 

52 

1281 

53 

1326 

TNFA 

75 

77 

2881 

8703 

216676 

100 

102 

5106 

15353 

510151 

TNFA  (imp.  state) 

75 

77 

2881 

78 

2926 

100 

102 

5106 

103 

5151 

opt.  TNFA 

75 

77 

2881 

78 

2926 

100 

102 

5106 

103 

5151 

unopt.  CNFA 

75 

77 

2881 

78 

2926 

100 

102 

5106 

103 

5151 

CNFA 

75 

77 

2881 

78 

2926 

100 

102 

5106 

103 

5151 

MYNFA 

75 

77 

2881 

78 

2926 

100 

102 

5106 

103 

5151 

TNFA 

125 

127 

7956 

23878 

992376 

150 

152 

11431 

34278 

1710226 

TNFA  (imp.  state) 

125 

127 

7956 

128 

8001 

150 

152 

11431 

153 

11476 

opt.  TNFA 

125 

127 

7956 

128 

8001 

150 

152 

11431 

153 

11476 

unopt.  CNFA 

125 

127 

7956 

128 

8001 

150 

152 

11431 

153 

11476 

CNFA 

125 

127 

7956 

128 

8001 

150 

152 

11431 

153 

11476 

MYNFA 

125 

127 

7956 

128 

8001 

150 

152 

11431 

153 

11476 

TNFA 

175 

177 

15531 

46553 

2710576 

200 

202 

20256 

60703 

4040301 

TNFA  (imp.  state) 

175 

177 

15531 

178 

15576 

200 

202 

20256 

203 

20301 

opt.  TNFA 

175 

177 

15531 

178 

15576 

200 

202 

20256 

203 

20301 

unopt.  CNFA 

175 

177 

15531 

178 

15576 

200 

202 

20256 

203 

20301 

CNFA 

175 

177 

15531 

178 

15576 

200 

202 

20256 

203 

20301 

MYNFA 

175 

177 

15531 

178 

15576 

200 

202 

20256 

203 

20301 

TNFA 

225 

227 

25606 

76728 

5746276 

250 

252 

31581 

94628 

7875376 

TNFA  (imp.  state) 

225 

227 

25606 

228 

25651 

250 

252 

31581 

253 

31626 

opt.  TNFA 

225 

227 

25606 

228 

25651 

250 

252 

31581 

253 

31626 

unopt.  CNFA 

225 

227 

25606 

228 

25651 

250 

252 

31581 

253 

31626 

CNFA 

225 

227 

25606 

228 

25651 

250 

252 

31581 

253 

31626 

MYNFA 

225 

227 

25606 

228 

25651 

250 

252 

31581 

253 

31626 

23 


(HA)(6|A)-..)* 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

10 

0.14 

0.02 

0.04 

0.02 

0.00 

0.02 

- 

- 

- 

20 

1.44 

0.04 

0.14 

0.04 

0.00 

0.06 

- 

- 

- 

30 

6.24 

0.10 

0.28 

0.10 

0.02 

0.06 

5.00 

14.00 

3.00 

40 

19.20 

0.16 

0.58 

0.14 

0.00 

0.18 

- 

- 

- 

50 

45.64 

0.26 

1.10 

0.22 

0.02 

0.24 

13.00 

55.00 

12.00 

60 

93.00 

0.36 

1.84 

0.32 

0.02 

0.36 

18.00 

92.00 

18.00 

70 

170.80 

0.48 

2.88 

0.42 

0.02 

0.48 

24.00 

144.00 

24.00 

80 

287.48 

0.68 

4.24 

0.56 

0.02 

0.62 

34.00 

212.00 

31.00 

90 

457.08 

0.78 

5.88 

0.72 

0.02 

0.74 

39.00 

294.00 

37.00 

100 

693.54 

0.90 

8.00 

0.92 

0.02 

1.06 

45.00 

400.00 

53.00 

Constructed  DFA  Size 


machine 

length 

node  no 

edge  no 

node  weight 

edge  weight 

length 

node  no 

edge  no 

node  weight 

edge  weight 

TNFA 

10 

11 

110 

363 

3630 

20 

21 

420 

1323 

26460 

TNFA  (imp.  state) 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

opt.  TNFA 

10 

10 

100 

10 

100 

20 

20 

400 

20 

400 

unopt.  CNFA 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

CNFA 

10 

2 

20 

2 

20 

20 

2 

40 

2 

40 

MYNFA 

10 

11 

110 

11 

110 

20 

21 

420 

21 

420 

TNFA 

30 

31 

930 

2883 

86490 

40 

41 

1640 

5043 

201720 

TNFA  (imp.  state) 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

opt.  TNFA 

30 

30 

900 

30 

900 

40 

40 

1600 

40 

1600 

unopt.  CNFA 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

CNFA 

30 

2 

60 

2 

60 

40 

2 

80 

2 

80 

MYNFA 

30 

31 

930 

31 

930 

40 

41 

1640 

41 

1640 

TNFA 

50 

51 

2550 

7803 

390150 

60 

61 

3660 

11163 

669780 

TNFA  (imp.  state) 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

opt.  TNFA 

50 

50 

2500 

50 

2500 

60 

60 

3600 

60 

3600 

unopt.  CNFA 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

CNFA 

50 

2 

100 

2 

100 

60 

2 

120 

2 

120 

MYNFA 

50 

51 

2550 

51 

2550 

60 

61 

3660 

61 

3660 

TNFA 

70 

71 

4970 

15123 

1058610 

80 

81 

6480 

19683 

1574640 

TNFA  (imp.  state) 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

opt.  TNFA 

70 

70 

4900 

70 

4900 

80 

80 

6400 

80 

6400 

unopt.  CNFA 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

CNFA 

70 

2 

140 

2 

140 

80 

2 

160 

2 

160 

MYNFA 

70 

71 

4970 

71 

4970 

80 

81 

6480 

81 

6480 

TNFA 

90 

91 

8190 

24843 

2235870 

100 

101 

10100 

30603 

3060300 

TNFA  (imp.  state) 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

opt.  TNFA 

90 

90 

8100 

90 

8100 

100 

100 

10000 

100 

10000 

unopt.  CNFA 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

CNFA 

90 

2 

180 

2 

180 

100 

2 

200 

2 

200 

MYNFA 

90 

91 

8190 

91 

8190 

100 

101 

10100 

101 

10100 

24 


(a|6)*a(a|6)n 


Construction  Time 


length 

TNFA 

TNFA 

(imp.  state) 

opt.  TNFA 

unopt.  CNFA 

CNFA 

MYNFA 

TNFA  (imp.  state)  vs 

CNFA 

opt.  TNFA  vs 

CNFA 

MYNFA  vs 

CNFA 

1 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 

1.00 

2 

0.02 

0.02 

0.02 

0.00 

0.02 

0.00 

1.00 

1.00 

0.00 

3 

0.04 

0.02 

0.02 

0.00 

0.02 

0.00 

1.00 

1.00 

0.00 

4 

0.02 

0.02 

0.02 

0.00 

0.02 

0.02 

1.00 

1.00 

1.00 

5 

0.06 

0.02 

0.04 

0.02 

0.02 

0.04 

1.00 

2.00 

2.00 

6 

0.16 

0.04 

0.06 

0.06 

0.06 

0.08 

0.67 

1.00 

1.33 

7 

0.30 

0.14 

0.12 

0.14 

0.14 

0.14 

1.00 

0.86 

1.00 

8 

0.70 

0.26 

0.22 

0.30 

0.28 

0.36 

0.93 

0.79 

1.29 

9 

1.58 

0.54 

0.50 

0.76 

0.56 

0.94 

0.96 

0.89 

1.68 

10 

3.54 

1.14 

1.06 

2.26 

1.22 

3.34 

0.93 

0.87 

2.74 

Constructed  DFA  Size 


machine 

length 

node  no 

edge  no 

node  weight 

edge  weight 

length 

node  no 

edge  no 

node  weight 

edge  weight 

TNFA 

1 

5 

10 

39 

83 

2 

9 

18 

89 

183 

TNFA  (imp.  state) 

1 

5 

10 

9 

19 

2 

9 

18 

21 

43 

opt.  TNFA 

1 

2 

8 

16 

20 

40 

unopt.  CNFA 

1 

5 

10 

9 

19 

2 

9 

18 

21 

43 

CNFA 

1 

5 

10 

9 

19 

2 

9 

18 

21 

43 

MYNFA 

1 

5 

10 

9 

19 

2 

9 

18 

21 

43 

TNFA 

3 

17 

34 

205 

HO 

4 

33 

66 

469 

943 

TNFA  (imp.  state) 

3 

17 

34 

49 

99 

4 

33 

66 

113 

227 

opt.  TNFA 

3 

16 

32 

48 

96 

4 

32 

64 

112 

224 

unopt.  CNFA 

3 

17 

34 

49 

99 

4 

33 

66 

113 

227 

CNFA 

3 

17 

34 

49 

99 

4 

33 

66 

113 

227 

MYNFA 

3 

17 

34 

49 

99 

4 

33 

66 

113 

227 

TNFA 

5 

65 

130 

1061 

2127 

6 

129 

258 

2373 

4751 

TNFA  (imp.  state) 

5 

65 

130 

257 

515 

6 

129 

258 

577 

1155 

opt.  TNFA 

5 

64 

128 

256 

HE3 

6 

128 

256 

576 

1152 

unopt.  CNFA 

5 

65 

130 

257 

6 

129 

258 

577 

1155 

CNFA 

5 

65 

130 

257 

6 

129 

258 

577 

1155 

MYNFA 

5 

65 

130 

257 

6 

129 

258 

577 

1155 

TNFA 

7 

257 

5253 

10511 

8 

513 

1026 

11525 

23055 

TNFA  (imp.  state) 

7 

257 

1281 

2563 

8 

513 

1026 

2817 

5635 

opt.  TNFA 

7 

256 

1280 

2560 

8 

512 

1024 

2816 

5632 

unopt.  CNFA 

7 

257 

1281 

2563 

8 

513 

1026 

2817 

5635 

CNFA 

7 

257 

1281 

2563 

8 

513 

1026 

2817 

5635 

MYNFA 

7 

257 

1281 

2563 

8 

513 

1026 

2817 

5635 

TNFA 

9 

1025 

2050 

25093 

50191 

10 

2049 

4098 

54277 

108559 

TNFA  (imp.  state) 

9 

1025 

2050 

6145 

12291 

10 

2049 

4098 

13313 

26627 

opt.  TNFA 

9 

1024 

2048 

6144 

12288 

10 

2048 

4096 

13312 

26624 

unopt.  CNFA 

9 

1025 

2050 

6145 

12291 

10 

2049 

4098 

13313 

26627 

CNFA 

9 

1025 

2050 

6145 

12291 

10 

2049 

4098 

13313 

26627 

MYNFA 

9 

1025 

2050 

6145 

12291 

10 

2049 

4098 

13313 

26627 

25 


